04_describe

Load Data

#|message: false
data <- read_tsv("./../data/03_dat_aug.tsv",
                 show_col_types = FALSE)

Data Description

This dataset comprises a comprehensive registry detailing immunological and demographic characteristics, encompassing 216 observations across 41 variables. It captures data from 54 (29 cases and 25 control) patients across four distinct visits, including demographic details like gender, ethnicity, race, and age at diagnosis and visit. It encompasses key immunological markers such as GAD65, IA_2, IAA, ZnT8 antibodies, along with extensive HLA allele information (HLA_A, HLA_B, HLA_C, HLA_DPA1, HLA_DPB1, HLA_DQA1, HLA_DQB1, HLA_DRB1). Additionally, it contains metrics (e.g., total_templates, total_rearrangements, productive_templates) about the T-cell receptor β region, crucial for diversifying T-cell receptors and recognizing antigens. This dataset elucidates the intricate relationship between demographic factors and immunological profiles across multiple patient visits, providing invaluable insights for in-depth analysis and comprehension. Refer to the complete list of variables provided below.

Variables

data |>
  tbl_vars()
<dplyr:::vars>
 [1] "sample_name"                  "total_templates"             
 [3] "productive_templates"         "fraction_productive"         
 [5] "total_rearrangements"         "productive_rearrangements"   
 [7] "productive_simpson_clonality" "max_productive_frequency"    
 [9] "locus"                        "sku"                         
[11] "test_name"                    "case_control"                
[13] "id"                           "timepoint"                   
[15] "gender"                       "ethnicity"                   
[17] "race"                         "age"                         
[19] "age_diagnosis"                "age_visit"                   
[21] "antibody"                     "expression"                  
[23] "allele"                       "subtype"                     

Age

data |> 
  drop_na(gender, age, case_control) |>
  select(gender, age, case_control, id, timepoint) |> 
  distinct() |> 
  mutate(age = case_when(age < 5 ~ factor("(0,5]"),
                         age < 10 ~ factor("(5,10]"),
                         age < 15 ~ factor("(10,15]"),
                         age < 20 ~ factor("(15,20]"),
                         age < 25 ~ factor("(20,25]"))) |> 
  filter(!is.na(age)) |> 
  ggplot(mapping = aes(x = age,
                       fill = case_control)) +
  geom_bar(alpha = 0.8,
           position = position_dodge(preserve = "single"),
           color = "black") +
  theme_bw() +
  labs(title = "Age group stratified by case/control per timepoint",
       x = "Age",
       y = "Count",
       fill = "Case/Control") +
  scale_fill_manual(values = c(Case = "#481567",
                               Control = "#20A387")) +
  facet_wrap(~timepoint) +
  theme(legend.position = "bottom")

ggsave(width = 8,
       height = 3,
       filename = "04_plot_1.png",
       path = "./../results/")

data |>
  drop_na(age_diagnosis) |>
  ggplot(data = _,
         mapping = aes(x = age_diagnosis)) +
  geom_boxplot(color = "#481567",
               fill = "#20A387",
               alpha = 0.8,
               size = 0.5) +
  theme_bw() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(title = "Age at diagnosis",
       x = '')

ggsave(width = 8,
       height = 3,
       filename = "04_plot_2.png",
       path = "./../results/")

The age distribution of patients predominantly falls within the 0-5 age group. Notably, more than half were diagnosed between ages 10 and 15. The participants underwent testing from early childhood into their twenties. Among the 29 individuals, only 5 received diagnoses after age 15, while 24 were diagnosed before reaching that age. Given the data-set is limited size of 29 observations, the analysis results might diverge from the actual pattern.

TCR-Beta diversity

Total rearrangement

data |>
  group_by(case_control, age_visit, timepoint) |>
  summarize(total_temp_mu = log(mean(total_rearrangements))) |>
  ggplot(data = _,
         mapping = aes(x = age_visit,
                       y = total_temp_mu)) +
  geom_point(mapping = aes(color = timepoint)) +
  labs(title = "Mean of Total Rearrangements vs. Age at visit",
       subtitle = "Faceted by outcome and stratified by timepoint",
       x = "Age",
       y = "log( Total Rearrangement )",
       color = "Timepoint") +
  scale_color_viridis() +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(vars(case_control)) +
  theme_bw()

ggsave(width = 8,
       height = 6,
       filename = "04_plot_3.png",
       path = "./../results/")

Productive rearrangement

data |>
  group_by(case_control, age_visit, timepoint) |>
  summarize(prod_temp_mu = log(mean(productive_rearrangements))) |>
  ggplot(data = _,
         mapping = aes(x = age_visit,
                       y = prod_temp_mu)) +
  geom_point(mapping = aes(color = timepoint)) +
  labs(title = "Mean of Productive Rearrangements vs. Age at visit",
       subtitle = "Faceted by outcome and stratified by timepoint",
       x = "Age",
       y = "log( Productive Rearrangement )",
       color = "Timepoint") +
  scale_color_viridis() +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(vars(case_control)) +
  theme_bw()

ggsave(width = 8,
       height = 6,
       filename = "04_plot_4.png",
       path = "./../results/")

We see a slight rise in total and productive rearrangements for case patients and a decrease for control patients, when comparing the values against age at visit. So the number of rearrangements seems to be stable in case patients during aging, however there is decreasing rearrangement for the control patients.

Fraction productive rearrangement

data |>
  group_by(case_control,
           age_visit,
           timepoint) |>
  summarize(frac_prod = mean(productive_rearrangements) / mean(total_rearrangements)) |>
  ggplot(data = _,
         mapping = aes(x = age_visit,
                       y = frac_prod)) +
  geom_point(mapping = aes(color = timepoint)) +
  labs(title = "Fraction of Productive Rearrangements vs. Age at visit",
       subtitle = "Faceted by outcome and stratified by timepoint",
       x = "Age",
       y = "Fraction of Productive Rearrangements",
       color = "Timepoint") +
  scale_color_viridis() +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(vars(case_control)) +
  theme_bw()

ggsave(width = 8,
       height = 6,
       filename = "04_plot_5.png",
       path = "./../results/")

We see a steep rise in the fraction of productive rearrangement with age (regardless of outcome for patients), which must mean that the differences in values between productive rearrangement and total rearrangement become smaller with age.

Productive Simpson Clonality

data |> 
  ggplot(data = _,
         mapping = aes(x = age_visit,
                       y = log(productive_simpson_clonality))) +
  geom_point(mapping = aes(color = timepoint)) +
  labs(title = "Productive Simpson Clonality vs. Age",
       subtitle = "Faceted by outcome and stratified by timepoint",
       x = "Age",
       y = "log( Productive Simpson Clonality )") +
  scale_color_viridis() +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(vars(case_control)) +
  theme_bw()

ggsave(width = 8,
       height = 6,
       filename = "04_plot_6.png",
       path = "./../results/")

Both case and control show a direct proportionality between productive simpson clonality and age, which shows the decrease of polyclonality with age.